refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux by rapsealk · Pull Request #11223 · lablup/backend.ai

rapsealk · 2026-04-22T04:32:00Z

Closes #11220 (BA-5860)
Refs #11216

Summary

Default CPU/memory stat collection to StatModes.CGROUP on native Linux (sysfs path, ~100-500 ms → <10 ms per sample).
Keep StatModes.DOCKER as the forced path on linuxkit / Docker Desktop and as a per-read fallback when a cgroup sysfs read fails.
Dedup "cgroup sysfs failed, falling back to Docker API" warnings per (plugin, container_id) via a bounded cachetools.LRUCache so a systemic sysfs failure doesn't storm logs.
Replace raise RuntimeError("should not reach here") in exhaustive match defaults with typing.assert_never (satisfies the agent rule banning built-in exceptions in business logic + strengthens mypy exhaustiveness).
No changes to network/IO stat collection.

Operator note — possible metric step-down on upgrade

Hosts previously running stats-type: docker may see:

Memory: reported usage steps down by the inactive-file cache amount (sysfs now matches docker stats CLI semantics).
Block I/O (cgroup v1 only): readings may shift because sysfs reads blkio.throttle.io_service_bytes while the Docker API path reads blkio_stats.io_service_bytes_recursive.

Dashboards and autoscaling thresholds tuned to the old values should be re-evaluated.

Why

Agent-runtime performance epic #11216; sysfs is an order of magnitude faster for CPU/memory stats than the Docker API round-trip.

Test plan

pants test tests/unit/agent:: passes (CPU + Memory plugin fallback, linuxkit guard, warn dedup)
On a native Linux host, confirm CPU/memory stats populate via cgroup path (no Docker API calls for those fields)
Simulate a cgroup read failure; confirm the Docker-API fallback triggers and the one-time warning fires

On Linux native hosts, the sysfs (cgroup) stat collection path is strictly faster (<10 ms) than the Docker API path (100-500 ms), yet the previous default selected the slow path for most cases. - Change the `container.stats-type` default from `docker` to unset so that the agent auto-selects the fastest available path (`cgroup` on native Linux, `docker` elsewhere). - Wrap the cgroup CPU and memory readers with a per-read Docker API fallback: when a sysfs read returns None (missing path or unreadable), emit a rate-limited warning and fall back to the Docker API instead of silently dropping the sample. - Force the Docker API path on linuxkit / Docker Desktop hosts even when `StatModes.CGROUP` is selected, since cgroup paths are not reliably accessible there. Network and block I/O stats continue to use the Docker API path. `StatModes.DOCKER` remains available as an explicit config override. Closes #11220 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR changes the agent’s default container stats collection strategy so that native Linux hosts prefer the faster cgroup/sysfs path for CPU/memory, while keeping Docker API usage for linuxkit/Docker Desktop and as a runtime fallback when cgroup reads fail.

Changes:

Switch stats_type default from docker to “auto” (unset/None) so StatContext selects the preferred mode per host.
Add per-container cgroup→Docker-API fallback behavior (and one-time warning) for CPU and memory stats collection; force API on linuxkit.
Add unit tests covering CPU fallback-on-failure and linuxkit forcing API.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`tests/unit/agent/test_docker_intrinsic.py`	Adds CPUPlugin unit tests for cgroup sysfs failure fallback and linuxkit forcing API.
`src/ai/backend/agent/docker/intrinsic.py`	Implements cgroup-first fallback logic for CPU/Memory and linuxkit forcing API; adds warning suppression helper.
`src/ai/backend/agent/config/unified.py`	Changes `stats_type` default to `None` (auto-select) and updates help text/validator typing.
`changes/11220.enhance.md`	Adds changelog entry for the new default behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T04:37:04Z

+# Tracks containers for which we have already logged a cgroup->Docker-API
+# fallback warning, to avoid log spam on persistent read failures.
+_cgroup_fallback_warned: set[str] = set()
+
+
+def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None:
+    key = f"{plugin}:{container_id}"
+    if key in _cgroup_fallback_warned:
+        return
+    _cgroup_fallback_warned.add(key)
+    log.warning(
+        "{0}: cgroup sysfs read failed for container {1}; falling back to Docker API",
+        plugin,
+        container_id[:7],
+    )


_cgroup_fallback_warned is a process-global set that only grows and is never pruned. On agents with high container churn this can become an unbounded memory sink over time. Consider bounding it (e.g., LRU/TTL cache) or clearing entries when containers are removed / after some time window.

Copilot · 2026-04-22T04:37:05Z

+def _warn_cgroup_fallback_once(plugin: str, container_id: str) -> None:
+    key = f"{plugin}:{container_id}"
+    if key in _cgroup_fallback_warned:
+        return
+    _cgroup_fallback_warned.add(key)
+    log.warning(
+        "{0}: cgroup sysfs read failed for container {1}; falling back to Docker API",
+        plugin,
+        container_id[:7],
+    )
+


_warn_cgroup_fallback_once() is meant to prevent log spam on persistent sysfs failures, but sysfs_impl() already logs a warning on every OSError before returning None. With the new fallback, persistent failures will still spam the existing sysfs warning (and add one more warning once). Consider moving the per-container warning solely to the fallback layer (or otherwise suppressing/reducing repeated sysfs warnings) so the anti-spam behavior is effective.

Copilot · 2026-04-22T04:37:05Z

                "Method for collecting container resource statistics. "
                "'docker' uses Docker's stats API (most compatible). "
-                "'cgroup' reads from cgroup filesystem directly (more accurate, requires root). "
-                "'null' disables statistics collection."
+                "'cgroup' reads from cgroup filesystem directly (faster and more accurate, "
+                "requires root on Linux). "
+                "When unset, the agent auto-selects: cgroup on native Linux hosts "
+                "(with Docker API fallback on read failure) and docker elsewhere."


The PR description says there are no changes to network/IO stat collection, but switching the default stats_type to auto-select cgroup on native Linux will cause MemoryPlugin.gather_container_measures() to use its sysfs implementation by default (which collects io/net via cgroup + /proc), rather than Docker stats API. If the intent is to keep network/IO coming from the Docker stats API by default, the mode selection likely needs to be split (CPU/memory via cgroup, net/IO via API) or the PR description should be updated to reflect the behavior change.

Copilot · 2026-04-22T04:37:05Z

+        async def cgroup_first_impl(
+            container_id: str,
+        ) -> tuple[int, int, int, int, int, int, int] | None:
+            result = await sysfs_impl(container_id)
+            if result is None:
+                _warn_cgroup_fallback_once("MemoryPlugin", container_id)
+                return await api_impl(container_id)
+            return result
+
+        match ctx.mode:
+            case StatModes.CGROUP if not _is_linuxkit(self.local_config):
+                impl = cgroup_first_impl
+            case StatModes.CGROUP | StatModes.DOCKER:
+                impl = api_impl
+            case _:
+                raise RuntimeError("should not reach here")


MemoryPlugin now has a cgroup-first fallback-to-API path (and linuxkit forcing API) similar to CPUPlugin, but the new behavior isn't covered by unit tests (unlike CPU where new tests were added). Please add tests that (1) verify MemoryPlugin falls back to fetch_api_stats() when sysfs_impl() returns None and (2) verify linuxkit forces API even when ctx.mode is CGROUP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Extend changelog to call out the memory metric shift (sysfs excludes inactive file cache; dashboards may step-down). - Bound the _cgroup_fallback_warned set to prevent unbounded growth on hosts with high container churn. - Replace RuntimeError("should not reach here") with assert_never() to satisfy the BackendAIError-only rule and strengthen exhaustive match-statement type checking. - Add parallel MemoryPlugin fallback + linuxkit tests and a dedup-warn test. Refs #11220 Refs #11223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk · 2026-04-22T06:04:58Z

Review summary from an independent pass + resolutions pushed to this branch (commit 9cc2b7e71).

Verdict: ship with changes — addressed.

Finding	Resolution
Memory metric will step-down on upgrade (sysfs excludes inactive-file cache; matches `docker stats` CLI) — not called out for operators	`changes/11223.enhance.md` extended to warn that dashboards and autoscaling thresholds tuned to the previous `stats-type: docker` values may need re-tuning
`_cgroup_fallback_warned: set[str]` grew unbounded on container churn	Replaced with `cachetools.LRUCache(maxsize=1024)` — same pattern used elsewhere in the agent
`raise RuntimeError("should not reach here")` at two `match ctx.mode:` sites violates the `BackendAIError`-only rule	Both sites now use `assert_never(ctx.mode)`, which also strengthens mypy exhaustiveness
No mirror tests for `MemoryPlugin` fallback / linuxkit guard; no dedup-warn tests	Added 4 tests: mirror CPU fallback + linuxkit tests for MemoryPlugin, plus `TestWarnCgroupFallbackOnce` covering dedup and LRU eviction

Cgroup v1/v2 both handled (pre-existing). Network/IO still goes through Docker API per spec. Selection via StatModes.get_preferred_mode() correctly picks up is_containerized() — agents running inside a container stay on DOCKER mode. Cgroup-first wraps sysfs with a per-read Docker-API fallback, so transient sysfs errors downgrade gracefully rather than silently zeroing metrics. Linuxkit guard belt-and-suspenders against Docker Desktop / colima.

_is_linuxkit() wrapped a `==` comparison in bool(...), which is a no-op since == already returns bool. Refs #11220 Refs #11223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Extend changes/11223.enhance.md with a short note that block-I/O readings may also step down on cgroup v1 hosts (the sysfs path reads blkio.throttle.io_service_bytes; the API path reads io_service_bytes_recursive). - Drop TestWarnCgroupFallbackOnce::test_evicts_beyond_limit — it was re-verifying cachetools.LRUCache eviction, not project logic. test_deduplicates_per_container already covers the project contract. Refs #11220 Refs #11223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add TODO(#11223) notes on CPU/Memory api_impl paths: after sysfs-first lands, the Docker stream on these plugins is only needed for network and blkio (which neither consumes), so the stream should migrate to a network/IO consumer. - Add TODO(#11232) at the two per-plugin DockerStatsStreamer instantiation sites flagging the shared-streamer refactor. - Add a test covering the DockerError 404 branch of _read_stream: reader exits cleanly on container-gone without spinning retries. Refs #11219 Refs #11223 Refs #11224 Refs #11232 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk added this to the 26.5 milestone Apr 22, 2026

Copilot AI review requested due to automatic review settings April 22, 2026 04:32

github-actions Bot assigned rapsealk Apr 22, 2026

github-actions Bot added size:L 100~500 LoC comp:agent Related to Agent component labels Apr 22, 2026

Copilot started reviewing on behalf of rapsealk April 22, 2026 04:32 View session

rapsealk changed the title ~~enhance(agent): default to sysfs-first CPU/memory stats on native Linux~~ refactor(agent): Default to sysfs-first CPU/memory stats on native Linux Apr 22, 2026

Copilot AI reviewed Apr 22, 2026

View reviewed changes

rapsealk and others added 2 commits April 22, 2026 14:11

chore(changes): Rename news fragment to PR number

b4fe2ce

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk mentioned this pull request Apr 22, 2026

refactor(BA-5859): Stream container stats from Docker instead of polling #11224

Open

4 tasks

rapsealk requested review from a team and achimnol April 22, 2026 06:17

refactor(agent): Drop redundant bool() wrapper around equality check

99d5145

_is_linuxkit() wrapped a `==` comparison in bool(...), which is a no-op since == already returns bool. Refs #11220 Refs #11223 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk mentioned this pull request Apr 22, 2026

Share a single DockerStatsStreamer across CPU/Memory intrinsic plugins #11232

Open

rapsealk changed the title ~~refactor(agent): Default to sysfs-first CPU/memory stats on native Linux~~ refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223

refactor(BA-5860): Default to sysfs-first CPU/memory stats on native Linux#11223
rapsealk wants to merge 5 commits into
mainfrom
perf/11220-sysfs-first-stats

rapsealk commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

rapsealk commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rapsealk commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Operator note — possible metric step-down on upgrade

Why

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

rapsealk commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rapsealk commented Apr 22, 2026 •

edited

Loading